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Abstract 

In this paper we discuss different approaches for exploiting parallelism in 
the ICCG method for solving large sparse symmetric positive definite systems 
of equations on a shared memory parallel computer. Techniques for efficiently 
solving triangular systems and computing sparse matrix- vector products are 
explored. Three methods for scheduling the tasks in solving triangular systems 
are implemented on the Sequent Balance 21000. Sample problems that are rep- 
resentative of a large class of problems solved using iterative methods are used. 
We show that astatic analysis to determine data dependences in the triangular 
solve can greatly improve its parallel efficiency. We also show that ignoring 
symmetry and storing the whole matrix can reduce solution time substantially. 
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1 Introduction 


We explore different schemes for exploiting the parallelism available in the ICCG 
method for solving large sparse systems of linear equations on a shared memory 
computer. All of this work has been conducted on a 12 processor Sequent Bal- 
ance 21000. We have looked at the efficient implementation of methods for solving 
triangular systems and at sparse matrix vector multiplication. 

An important difficulty in solving general sparse triangular systems is that the 
available parallelism depends on the zero structure of the matrix, and is therefore not 
known at compile time. The concurrency is data dependent and can be determined 
only at run time. We show that bv performing a small amount of analysis to 
determine the data dependences one can drastically improve the parallel efficiency. 
We permute (reorder) the index set of the recurrence equation for the triangular 
solve and put the indices in a queue. The processors repeatedly take indices from 
the queue, perform the associated calculations, and then take another index until all 
unknowns have been computed. Data dependences are resolved by semaphores . A 
semaphore is a variable that can be operated upon only by synchronizing primitives. 
We check indices in a shared array that indicate whether each of the unknowns has 
been computed. If a calculation depends on a piece of data and an entry in the 
shared array indicates that it has not been computed then the processor performing 
the calculation must busy wait . Busy waiting is when a processor loops waiting for 
a flag to change value. 

Also, we show that there is a tradeoff between storing the lower triangular part 
of a symmetric matrix and storing the entire matrix. Storing the lower part to save 
storage complicates the multiplication since both outer products (which require syn- 
chronization) and inner products must be performed. The synchronization overhead 
slows down this operation. 

For our experiments we work with systems of equations in the form they are 
presented. We do not consider the problem of reordering the rows and columns to 
enhance parallelism. 

The rest of this paper is organized as follows. Section 2 review's related research. 
Section 3 contains a brief discussion of the ICCG method. Section 4 discusses how 
the dependence graph is used to exploit the parallelism in solving sparse triangular 
systems. Section 5 contains numerical experiments that show it is more efficient to 
store the whole symmetric matrix than only the upper or lower triangular part. In 
Section 6 we compare solving a lower triangular system by inner products versus 
solving by outer products. Section 7 presents the efficiency of the ICCG method 
using the techniques described in the previous sections. Section S dicusses other 
scheduling methods not used in this paper. Section 9 contains remarks and conclu- 
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sion s. Appendix A describes the 7 lest, eases used in t lie experiments. Appendix 
B discusses the architecture of the Sequent Balance 21000 and provides times for 
arithmetic operations and synchronization primitives. In Appendix C we show how 
the time to access array elements increases ns a function of the array size on the 
Sequent Balance 21000. 

2 Related Work 

Level scheduling methods are considered in [2.12.27]. Anderson [2] compares two 
different scheduling methods for solving sparse triangular systems on the Alliant 
FX/S. a shared memory machine. They ar p Jonranl it n l scht dating* in which each 
unknown in the triangular solve is computed as early as possible, and backward heel 
scheduling . in which each unknown is computed at the latest possible time. A level 
scheduling approach partitions the loop of the recurrence equation into a sequence 
of fully parallelized do loops (levels) separated by global synchronizations. He shows 
that the overhead in scheduling tasks to he performed as late as possible is not worth 
the time savings. 

Baxter ft. til. [3] compare ten l scheduling with a st If .scheduling method using a 
shared memory computer, an Encore Muhimax/220. I he srlj scheduling met hod is 
a two step procedure to parallelize the recurrence 1 equation of the triangular solve. 
First, one performs a topological sort of the dependence graph to permute the index 
loop. Xext, statically assign elements of the index set to the processors of the 
system. Global synchronizations are avoided by requiring processors to write into 
specified locations of shared arrays when work on a particular index is completed. 
Before a variable can be used, a processor makes sure that the appropriate values 
have been calculated by hustj waiting on a designated value in the shared memory. 
They show that If .scheduling performs belter than if cel schi dul'nig tor all but one 
of their lest cases. 

The work of Salt/ < i. <d. [23] is similar to the work <>1 Baxter. 1 hey also compare 
leirl .scheduling and self sfdud tiling o\\ an Encore ami reach similar conclusions. Salt/ 
also proposes a new programming construct, doconsider which allows compilers to 
parallelize many problems in which substantial loop-level parallelism is available but 
cannot be detected by standard compile-time analysis. 

The difference between the work preset! ted here and previous work on triangular 
systems is that we use dynamic scheduling to assign tasks to processors and the 
others use static scheduling. 

In this paper we focus on general parallel processors but others have studied 
implementations on parallel vector machines [l3.PL20.2fi]. Additionally. Sand [21; 


3 


ORIGINAL PAGE IS 
OF POOR QUALITY 


presents a survey of recent research in Krylov subspare methods with an emphasis 
on parallel ami vector implementations. 

3 ICCG Background 

[{ere we give a brief introduction to the Incomplete ( holesky Conjugate Gradient 
(kCG) method. For detailed information of its derivation and properties see ref- 
erences [6. 1 1. 13. 1 7. IS.23j. i lie Conjugate Gradient ((G) method was proposed by 
liestenes and Sliefel [13] for the solution of 

1-/' = ft. (1) 

where .1 is a given symmetric positive definite A by A matrix, ft is a given A -vector 
and r is an A - vector to be computed. 

Starting from an initial guess the CG method generates a series of approx- 
imate solutions ./■ •b'd. The convergence rate is very poor for ill-conditioned problems 
[11]* One way to improve the convergence is to pre-condition (1) premultiply it 
by a conditioning matrix and thereby condense the eigenvalue spectrum [ 1 j . 

A popular precou<lit ioner is the Incomplete (’holesky preconditioner proposed 
by Xfeijerink and \au der \orst. [IS]: they perform an approximate Chuleskv- 

factorization LL^ of l with zero fill. F<[uation ( l) now becomes: 

[iir r )L~ l s] X = ( L~ r jfr l b. (2) 

L~ l is not explicitly computed, instead triangular systems are solved. Each iteration 
of the ICCG method requires the solution of two sparse triangular systems, a sparse 
matrix vector product, 3 saxpy’s and 2 inner products. 

We warn the reader that wo use an inconsistent notation here from the rest of 
the paper. Here we subscript a vector t<> indicate that it is a member of a sequence 
rather than referring to ari individual element. The greek letters represent scalars. 
The ICCG method is below: 

i'o := 0 
r 0 := ft 

ft :=MNU 

repeat For k~ 1 . 2. ... 

Solve LL r - k ^ x - r k . { for :t_i 
c . r „ ( / yr . 

'k * — - k _ i - k - 1 / .v - > 

Pk : = 1 +"1kPk- \ 
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( h = 0 ) 

’Pi = '<>) 


(*k := 3t_i r *-i /Pfc^P* 
»^A: ■ — '** A: — 1 4" ^kPk 
r k := r^_i - o;..4p fc 
until ||rjt||oo £ ^ 


For our codes we choose £ = 10 6 so our iteration stops when the infinity norm 
of the residual is reduced by 6 orders of magnitude. 

4 Triangular Systems 

At each ICCG iteration we solve the triangular systems 


Lq = r 


(3) 


and 

L T z = q. (4) 

Together, these two operations consume between 30 % and \l% of the total cpu time 
required to solve the system on a single processor for our test cases. The percentage 
depends on the sparsity of L - the more nonzero elements in L the higher the per- 
centage. The remaining time is consumed by sparse matrix- vector products, inner 
products and saxpy’s. These are relatively easy to compute in parallel. Efficient, 
parallel computation of the triangular solves is necessary to accelerate the entire 
computation. 

The system (3) is solved by 


<7i = 


Lu 


i = L, 


.V 


(■>) 


In the dense case, each q t depends on all j = 1 i — L. When L is sparse, 

each q t depends on a few other q y Another way to look at it is that once some q } 
has been computed, several other q s may be computed in parallel. It is possible to 
perform some simple analysis of the data dependences to determine which elements 
of q can be computed in parallel and determine which qj ' s each q t depends on. This 
information can be utilized to schedule tasks. For example, if //, depends on q s then 
q 3 should be scheduled before q t . Also, if qj and q k are independent tasks then we 
may schedule them to be computed in parallel. 
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Figure L: Sparsity Structure of L 

4.1 Computing the Dependence Graph 

The problem one faces when exploiting this type of parallelism is that it is data 
dependent and can only be recognized at run time, not at compile time. It depends 
entirely on the sparsity structure of L. L is usually read in as input or computed 
at an earlier stage of the program. The focus in this section and the next one is 
on lower triangular systems. A similar analysis can be done on upper triangular 
matrices. 

Consider solving ( 3 ) where L has the sparsity structure shown in Figure 1. 
Analysis of the structure of L enables us to construct a corresponding directed 
graph (digraph), the dependence graph G{L) — (V.E). There are N vertices, 
V = {1, . A"}, corresponding to the A rows of L (and A r elements of q). A 
nonzero element, at l UJ means that q t depends on qj; i.e.. q 3 must be calculated 
before q,. Therefore, we define the edges of G(L) as follows: E = {(/, i) | li tJ ^ 0}. 
We ignore the loops corresponding to the diagonal elements of L ( G{ L ) is acyclic). 
The depth of a vertex r t is 0 if it has no predecessors otherwise the depth of v t is the 
length of the longest directed path in G(L) whose origin is a vertex of depth 0 and 
terminus is v t . 

The dependence graph of L is shown in Figure 2. All nodes at depth 0 can 
be computed immediately. They have no dependences. q\ and q 3 can be solved 
directly. Once q 3 is computed we can solve for 74. After q\ is computed we ran solve 
for 72 and % in parallel. The unknowns q 2 and q$ depend only on 71. Once 72. qe 
and 74 have been computed, we can solve for 7.5, 77 in parallel. Vertices that have 
equal depth represent independent tasks. The fact that 72 and 76 can be computed 
as soon as has been computed, even if 73 has not been completed, illustrates 
the difference between It re l scheduling methods and self scheduling methods, level 


6 


ORIGINAL PAGE IS 
OF POOR QUALITY 


Figure 2: Dependence Graph of L 


scheduling computes tasks corresponding to vertices of equal depth in G in parallel. 
All tasks at a certain depth must be completed before tasks at the next level can be 
started. Global synchronizations are used to separate tasks at different depths, self 
scheduling allows tasks to start as soon as their associated dependences have been 
computed. 

4.2 Permuting the Index Set 

The index set for the sequential solution of equation (5) is i — l, .... A’. To exploit 
the parallelism in the forward solve we reorder the index set according to the depth 
of each index in the dependence graph. A vertex of a certain depth is put in the 
permuted set before all vertices of greater depth. We define postion(k) to be the 
number of elements in the premuted index set that precede k. If two vertices r t and 
Vj have equal depth then we put i\ in the permuted index set before Vj if 

maxi position(n)) < max(position{m)) such that ( n, / ), ( m, j) 6 E. 

ri m 

if 

max(position[n)) = ma x( posit ion[rn)) such that ( n, i).{ m. j) € F. 

n m 

then n = m and is placed in the list before j. This is a side effect of the 

sequential traversal of the data structure for L. 
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We rail this permuted index set fwcLschedule. For example, the permuted 
index set from the dependence graph in Figure 2 is 

fwd_schedule = {[, 3, 2, 6, 4, 5, 7, 8} 

Xote that 6 appears before 4. 6 is a descendent of 1 and 4 is a descendent of 3 and 
position(l) < position^ 3). 

One way to compute the f wd-schedule list is outlined here. First, as the matrix 
is assembled or read in, construct an array of length A\ called the ready array such 
that ready[i] is the number of nonzero elements in row i of L, We then scan the 
entries of the ready array looking for entries with a value of 1. If recidy(i) = 1 then q x 
can be solved for directly. This entry is put in a queue, Q , When we have inserted 
all entries with value 1 in Q we start the following loop. We follow the notation 
used in [i] for operations on a queue. A queue is a special kind of a list, where items 
are inserted at one end (the rear) and deleted from the other end (the front), 

fwd.schedule = nil 
While (e mpty(O) / true) 

L. i :=■ front(O) 

2. deque ue{Q) 

3. append i to fwd.schedule list 

4. for each nonzero element Lk t 

(a) ready(k) := recidy(k) — l 

(b) if (ready(k) = 1) then enqueue(Q, k) 

The dependence graph is not explicitly computed but the information it repre- 
sents is implicit in ready and the ordering of fwd.schedule. We require two integer 
arrays of length .V to hold fwd_schedule and ready . This additional storage is small 
relative to the storage for .4. L and the other .V-vectors needed for ICCG. 

Equation (4) is also solved with a permuted index set. which we store in the array 
back-schedule [] . It is computed by analyzing the dependence graph of G{ L r ), 
in a manner similar to that used to compute f wd_schedule. Let G(L T ) = {If. Ej), 
\f = V and Ej = {[j, i)\(i, j) 6 E}, G{ L T ) is the same as G(L) with the direction 
of the edges reversed. For the example shown in Figure 1. the schedules for solving 
the upper and lower triangular systems are the reverse of each other. This is not 
true in general. Suppose that we have the same lower triangular matrix as in Figure 
I except L % and L$$ are the only nonzero elements in row s of L, Then. will 
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case 

fwd_schedule 

fwd solve 

back^schedule 

bck solve 

Ax=b 

i 

.12 

.18 

■ii 

.16 

1.89 

2 

.06 

.06 

.05 

.06 

2.45 

3 

.05 

.05 

.04 

.04 

3.07 

4 

.07 

.08 

.06 

.08 

3.21 

5 

.10 

.11 

.09 

.10 

8.85 

6 

.25 

.30 

.23 

.28 

8.52 

7 

3.12 

3.25 

2.09 

2. 18 

655.39 


Table i: Time in seconds to compute task schedules vs. single sequential triangular 
solve and solving Ax=b in parallel 


be depth 1 in G(L) and 8 will be the 6 th element in fwd_schedule[] . But, will 
be at depth 0 in G(L r ) and the first element in back-schedule [] . 

The time to compute the permuted index sets is a little less than the time to 
compute a single sequential triangular solve and a small fraction of the time to 
solve (2) in parallel. The time in seconds to compute the forward and backward 
schedules for the test cases is shown in Table 1. We compare the time to compute the 
fwd.schedule and the back-schedule lists with the time to sequentially compute 
one forward arid backward solve and with the time to solve Ax = b in parallel. 

4.3 Forward Solve 

We solve (3) as follows. L is stored by columns and the forward solve is computed 
as a set of outer products, fwd.schedule is the list of indices which correspond to 
elements of q to be computed. It is treated as a queue of tasks to be executed by 
the pool of processors. Let there by V processors. Initially, the first P indices in 
the queue are assigned one to each processor. Let i be the index a processor gets 
from the queue. Before we compute each forward solve we set fwd_ready[] to be 
the number of nonzero elements in each row of L. If fwd_ready[i] ^ 1 , then the 
processor must busy wait , else, compute r, = b t j L lA . Xext, compute r,— = h JA q t 
and decrement fwd_ready[j] for all nonzero elements j of column i of L. Finally, 
if the queue is not empty get the next task. 

For the triangular solve, we experimentally compared three different techniques 
for parallelizing the code. We call the first method dynamic scheduling (DS). The 
elements of q are assigned to processors in order, from 1 to .V. They are computed 
as soon as the data they depend on is ready to be used. The data illustrate that 
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poor performance may be expected if the index set is left in its original order. 

The second technique is due to Baxter tt. ai [3]. We call this technique 
reordered static scheduling (RSS). They use the reordering strategy above, but em- 
ploy a static assignment of tasks from the permuted index set to processors. Let 
P be the number of processors. Processor 1 < i < P . gets tasks / + P x j for 
j ~ 0, . ^ has the advantage that for every iteration each processor 
will solve for the same values of q. This characteristic is especially noticeable for 
small problems when the entire problem fits in the local memory (or cache) of the 
processors. But this may not be very good at load balancing. If there are wide 
variations in the number of nonzero elements in the rows/columns of the matrix 
then the static mapping may cause unnecessary busy waiting. This variation arises 
in many different situations: non-uniform discretizations, adaptively refined meshes, 
or mixed element types (triangular and quadrilateral elements in the same grid) for 
instance. 

The problem with reordered static scheduling is that the position of the task in 
the schedule is determined solely by its depth in the dependence graph. The strategy 
does not consider the amount of time needed to perform the task. It is possible that 
a static assignment of tasks to processors could result in uneven distribution of work 
and lower or less throughput. 

The third technique is called reordered dynamic scheduling (RDS). We reorder 
the index set as above, but we put the indices (tasks) in a queue rather than stati- 
cally mapping them to processors. The first processor done with the work initially 
assigned to it takes the next job from the front of the queue. This is done to reduce 
the time spent busy waiting due to potential load imbalance. There is an addi- 
tional expense of maintaining a global pointer (nunextO on the Sequent) to the 
first element in the queue. 

The C code for reodered dynamic scheduling is shown in Figure 3. m_next() is 
the system function which increments a global counter and returns its current value, 
fwd.schedule [] is our permuted index set for the forward solve. As suggested by 
Duff, et. ai [7], we store the columns as packed sparse vectors held contiguously 
in the array 1[]. The row numbers of the corresponding nonzero entries held in 
1 [] are held, in the integer array row_num[]. The integer array start [i] points 
to the start of column i in array 1[] containing the nonzero elements of matrix L. 
In fact 1 [start [i]] is the diagonal element L tA . The global variables unknowns 
and tot-nonzero hold the number of rows in L and the total nonzero elements in 
L respectively start [unknowns+l] = tot-nonzero + 1. 

Once we have gotten a task from the queue, we check whether all of the data 
it needs are ready. This is done by looking at the value of the fwd_ready[i] array 
containing the number of direct dependences for row i. [f the value is greater than 
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parallel. fwdslvCl, q, row.num, my.id, num.proc) 
double 

in, qC3 ; 

short int 

row.num [] ; 

int 

my.id, num.proc ; 

{ 

register double 
tmp ; 

register int 

column, row, pointer; 
int 

task, m.nextO; 


task = m.next 0 ; 

/* get pointer into queue 

*/ 

while (task <= unknowns) { 

column = f wd. schedule [task] ; 

/* get column for this task 

*/ 

bck. ready [column] * WAIT; 

/* reset for back solve 

*/ 

pointer = start [column] ; 

/* get pointer into DS 

+/ 

while (fwd .ready [column] > 1) continue; 

; /* busy wait until ready 

*/ 

q[column] /* 1 [pointer++] ; 

/* solve for our q[i] 

*/ 

tmp = q[column] ; 

/* store it in a local var 

*/ 

while ( pointer < start [column* l] ) { 
row = row.num [po inter] ; 

S.LOCK (lp [row] ) ; 

/* set lock 

*/ 

q[row] 1 [pointer] * tmp; 

/* mult q[i] by column j 

*/ 

f wd.ready [row] — ; 

/* decrement depend, vector 

♦ / 

S.UNLDCK ( Ip [row] ) ; 

/* unlock lock 

*/ 

pointer**; /* move to next nonzero element */ 

> 

task = ra.next 0 ; /* get next task 

*/ 


> 

m.syncO; /* synchronize before returning */ 

} 


I* 1 [] - nonzero elements of L */ 
/* q [] vector to be computed */ 
/* row.numEi] - row of element i */ 

/* proc’s id and # of proc’s */ 


Figure .1: Procedure for Parallel Forward Solve 
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Type 

Case 

Sequential 

DS 

RSS 

RDS 


c 

tp 

'D 

f P 

effic. 

b 

effic. 

l 

2.07 

1.61 

.15 

.73 

.34 

.73 

.32 

2 

2.80 

.78 

.30 

.55 

.42 

.77 

.30 

3 

3.29 

1.07 

.26 

.72 

.38 

.99 

.28 

■t 

3.84 

1.54 

.21 

.90 

.36 

1.03 

.31 

0 

11.32 

3.48 

.27 

2.13 

.44 

2.92 

.32 

b 

13.29 

11.04 

.10 

2.73 

.41 

3.29 

.34 

i 

665.21 

418.21 

.13 

219.46 

.25 

206.81 

.27 


Table 2: Time in seconds and efficiency of parallel forward solve on 12 processors 
relative to sequential rode 


L then we busy wait. W hen fwd_ready[i] is equal to 1 the dependences for q t have 
been satisfied and we can compute q t . We set q t - qJL x and then loop over the 
nonzero elements in column i below the diagonal, computing qj — Qj - Lj^xq^ Then 
we decrement the value of the fwd_ready[j] array to indicate that one dependence 
for qj has been satisfied. The array q and the fwd_ready array are shared and 
access to individual elements must be synchronized using the system calls S-L0CKO 
and SJJNLOCKO. These synchronization procedures are called once for each nonzero 
off-diagonal element in the lower triangular matrix each iteration. This locking and 
unlocking operation takes about half of the time in the forward solve routine when 
the matrix is stored by columns. We also reset bck_ready[] for the next back solve 
operation. 

In Table 2 we show the results for the three methods explained above on seven 
test problems. This is the time spent during the iterative solution of Ax =6 doing 
forward solves. All times are measured in seconds. We also include the sequential 
time for for each problem, t s . The sequential time given is the best sequential 
code we could write running on one processor; there are no parallel constructs or 
synchronizations used. Parallel code running on 12 processors of the Sequent took 
time t : i. We measure efficiency as 


effic. 


ts 

tp X #proc 


#proc 


12 . 


The DS timings are included for comparison to illustrate the benefit of computing 
the dependence graph and permuting the index set. We see that both RSS and 
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Case 

tnDS - iRSS 

Predicted 

# iterations 

i 

0.05 

0.06 

16 

2 

0.22 

0.11 

41 j 

3 

0.29 

0.14 

68 

4 

0.13 

0.13 

15 

5 

0.79 

0.47 

101 

6 

0.44 

0.44 

43 


Table 3: Time Difference between RDS and RSS vs. Estimated time in seconds 


RDS are significantly better than DS, sometimes more than twice as efficient. The 
RSS method performs better than RDS in all but the last problem. In the first 
Test Case, RSS and RDS the two took almost the same time, and half as long as 
dynamic scheduling. In the last case, RDS was more efficient than RSS despite 
the calls to the global counter. This has several possible explanations. First, the 
number of nonzero elements in each column vvas more in the first and last cases 
than in the second through sixth cases. Thus, the relative overhead associated with 
the global counter versus the amount of work to do per call is less. In the last case, 
the number of nonzero elements per column varied between 8 and 10. Good load 
balancing is especially important in this case for increased throughput. Statically 
assigning tasks to processors by their depth in the dependence graph alone (as in 
RSS) cannot achieve this. There must be some way to account for the amount 
of work to be done in each task, not just the dependences of the task. I he RDS 
method performs better at this than the RSS method as shown in Test Case 7. 

When the problems have a regular sparsity structure (most of the columns/ rows 
have the same number of nonzero elements) the time to compute each q t i> roughly 
the same and the load is balanced as long as each processor gets roughly the same 
number of ry.’s to solve for. Test Cases 1-6 have a regular sparsity structure and thus 
the RSS method performed slightly better. The main contribution to this difference 
is the fact that in the RDS technique a global counter is used to maintain the queue 
of tasks. It takes about oOp-seconds for each call and this is done before each q t 
is computed. A prediction for the time difference when there is a regular sparsity 
structure is 

/ #unknowns\ . 

tRDS - tRss ^ (oO/j - seconds)! iterations) ^ # proc 7 s ' — ) ' (b) 

In Table 3 we compare the actual difference with the prediction for the first 6 cases. 
The right most column shows the number of iterations for convergence for each test 
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case. This model gives an estimate of the size of the difference that is correct to 
within a factor of two. 

4.4 Backward Solve 

The backward solve is similar to the forward solve, but there are subtle differences 
in implementation. To solve (4) we carry out the computation as a series of inner 
products rather than outer products. L r is accessed by rows since we store L by 
columns. 

An outline of our back solve procedure follows. Just as in the forward solve, 
we have a list of permuted indices back_schedule [] . It is computed in a manner 
analogous to f wd. schedule [] . back,schedule [] is treated as a queue of tasks to be 
computed by Tie processors. bck_ready[] is initialized to the value * 'WAIT* * . For 
some j. if bckjready [j] = WAIT then this indicates that z[jj has not been computed 
yet. F.ach processor gets an index from the queue as it begins the back solve. Let 
i be the index that some processor gets. For each nonzero element j in row i of 
L J . (’heck bck_ready [j] . If bckjready [j] * WAIT, then busy wait. Else, compute 
z[i] — = l\ jz[j]. When all nonzero off-diagnoal elements in row i have been used 
we calculate z[i] = z[i]/L]T and set bck.readyCi] = ' f DGNE , \ The value DONE 
indicates that the element of z[] is computed. Finally, if the queue is not empty 
get the next index. 

The C code for this technique is shown in Figure 4. As in the forward solve 
routine we compute the new vector in place, overwriting the previous entries of 
zd. in is the array containing the nonzero elements of the rows of the upper 
triangular matrix. The beginning of row i is pointed to by the array start [i]. To 
move across the nonzero elements of row T from right to left, we start at pointer = 
start [i+l] -1. start [i+i] points to in 1[] and start [i+l]-l points to 

the right-most nonzero element in row i. The bckjready [] array is set to "BUSY''’ 
during the previous forward solve. Therefore, if bckjready [j] = BUSY, then z[j] 
has not been computed yet in the back solve. To reset fwdjreadytj] for the next 
forward solve we set fwd^ready[j] = fwd_depend[j] . f wd_depend[j] is the num- 
ber of nonzero elements in row j of L. To indicate that z[j] has been computed in 
the back solve we set bckjready [j] to “DONE". The back.schedule [] array contains 
the index set that has been permuted appropriately for the back solve operation. 
Finally, we set a barrier m_synch() to synchronize all processors at the end of the 
procedure before returning. 

There is no need to do the locking and unlocking as in the forward solve routine. 
This procedure only writes to three -hared arrays fwd_ready[], bckjready []. and 
z [] . F.ach location is read by many processors but only written to by one processor. 



parallel_bckslv(l, z, row_nura, my_id, num.proc) 


double 

1[]» z C3 ; /* arrays for L and z */ 

short int 

row_num[] ; /* row_num[i] is row of element l[i] */ 

int 

my_id, num_proc; /* variables for processor # and # of processors */ 

{ 

register double 
tmp; 

register int 

row, column, pointer; 
int 

task , m_next 0 ; 

task = m__next(); ' /* get first task to do */ 

while (task <= unknowns) { 
row = back_3chedule [task] ; 
pointer ■ start [row+l] - 1; 

column * row_num [pointer] ; 

tmp = z[row]; /* copy z to local register variable */ 

while ( column > row) { 

while ( be k^ready [column] ** SPIN) continue;/* busy wait until ready */ 
tmp -■ 1 [pointer] * z [column] ; 

column 3 row^num [--po inter] ; /* get next column number */ 

> 

z[row] - tmp / 1 [pointer]; 

bck_ready [row] =* DONE; /* set flag that it is done */ 

f wd.ready [row] * f wd_depend[row] ; /* reset for next forward solve */ 

task * m_next () ; /* get next one to do */ 

> 

m_sync(); /* get all proc’s synched before returning */ 

> 


Figure l: Procedure for Parallel Backward Solve 
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Method 

Case 

Sequential 

DS 

RSS 

RDS 


t. 

h 

effic. 

h ‘ 

•ffic. 

tp 

effic. 

^ i 

2.70 

.95 

.21 

.38 

.58 

. 15 

.50 

2 

2.65 

.67 

.33 

.42 

.53 

.62 

.36 

3 

3.07 

.83 

.31 

.49 

.52 

.78 

.33 

4 

3.56 

1.01 

.29 

.55 

.54 

.81 

.37 

■5 

10.75 

2.63 

.34 

1 .54 

.58 

2.56 

.35 

6 

12.38 

7.18 

.14 

1.72 

.60 

2.35 

.14 

i 

507.90 

103.92 

.41 

85. 14 

.50 

85.80 

.49 


Table 4: Time in seconds and efficiency of parallel backward solve on 12 processors 
relative to sequential code 


Xo locking is required in this situation. The inner product form of the triangular 
solve therefore has much less overhead. 

In Table [ we compare the three methods for the backward solve. DS is clearly 
slower than the other two. It is only included for comparison. We see that the 
RSS method performs better than the other methods. Just as for the forward solve, 
the time difference is due to the fact that in RDS a global counter is required to 
maintain the queue of tasks. The difference is very pronounced for problems 1-6: 
since there is very little work to do to compute each z t ; i.e., there are only a few 
nonzero off diagonal elements in each row/column. The efficiency of RSS and RDS 
are almost identical for Test Case 7. The load balancing that is provided in RDS 
makes up for the overhead of using the global counter. The amount of work to be 
done to compute some z t is directly related to the number of nonzeros in row i of 
L. The amount of work per task affects the load balancing. Test Case 7 has the 
most variation in the number of nonzero elements in its rows (and columns). As the 
variation increases so does the need to account for this in the scheduling of tasks. 

5 Matrix- Vector Product 

In this section we discuss the implementation iff sparse matrix-vector products on 
the Sequent. We show that it is more efficient to store the whole symmetric matrix 
by rows rather than trying to save storage and storing only the lower or upper 
triangular half. This is true for both the sparse matrix- vector product and the 




triangular solves. To compute a general symmetric sparse matrix- vector product 
Ax = b on the Sequent it is more efficient to store all of .1 by rows than to store 
only the lower triangular part by rows (or columns). 

When a sparse symmetric matrix is stored as a lower triangular matrix by 
columns or rows (or if the upper triangular matrix is stored by columns or rows) 
the multiplication must be carried by a combination of inner and outer products. 
The implementation becomes complicated and requires synchronization to protect 
elements of shared arrays from being modified by more than one processor at a time. 

An implementation of a symmetric sparse matrix-vector product written in C 
is shown in Figure 5. For this example, only the nonzero elements of the lower 
triangular part of A are stored (by columns). Fir^t, each processor initializes a 
portion of the array b[] to be zero. Next, each processor gets a column of the 
data structure. This is a column of the lower triangular part of the matrix and 
a row of the upper triangular part. A column in the lower part, say column i, 
is multiplied by element x Ci] . The product is accumulated into the shared array 
b [] : b[j]+=a[pointer] xx[i] . To be sure that only one processor is writing to 
b[j] at a time we must use the system synchronization routines ST.0CKO and 
SJJNL0CKO. Next we multiply the element of column i by x[j] and add the product 
to the local variable inner_prod. When we have exhausted all elements of the upper 
triangular row, we add the local inner product into the shared array b[] using the 
appropriate locks. In essence, we accumulate inner products locally and add outer 
products globally. This approach requires two system synchronization calls per 
nonzero element in the lower triangular part of A. Even though the probability of 
collision is small since we are dealing with a sparse matrix, this has to be done to 
insure that only one processor updates an element of b[]. 

A procedure for computing a general sparse matrix-vector product where the 
full .1 is stored by rows is much simpler and is shown in Figure 6. Each processor 
computes a set of inner products. The processors dynamically get an element of b [] 
to compute using the system global counter m_next() The array row. start [] is an 
array holding the starting point for each row as it is stored in the data structure. 
The inner product of each row with x is computed and stored in b [] . This algorithm 
requires no synchronization since the work is divided into non-overlapping groups 
of rows. 

In Table 5 we compare two methods for parallel computation of the sparse ma- 
trix vector product with the time it takes to compute it sequentially. The first 
method, "Symmetric”, is the symmetric code from Figure ■*>. It takes advantage 
of symmetry and stores only the lower triangular part of the matrix. The second 
method, "Nosynch'’, is the same algorithm but we have commented out all of the 
synchronization calls to S_L0CK() and S.UNLOCKO. The answer we get is incorrect 
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raultCa, x, b, row_num, first, last, my_id, num.proc) 
double 

a[] . x[], b[] ; 
short int 

row^numE] ; 

int 

first, last, my_id, nura_proc; 

{ 

double 

x_elem, inner.prod; 
register int 
pointer, k; 
int 

row, column, ra_next(); 

for (k=f irst ; k<last; k++) b[k] - 0.0; /* zero array */ 

m__sync () ; 

column = m_next(); I* get our 1st column to start on */ 

pointer = start [column] ; /+ get position of 1st element */ 

while (column <= unknowns) { 
x.elem = xtcolumn] ; 

inner^prod * a[pointer++] * x_elem; /* compute a[i,i]*x[i] */ 

while (pointer < start [column* l] ) { 
row = row_num[pointer] ; 

S.L0CK(lp[row]) ; 

b[row] +* a[pointer] * x_elem; /* this is part of the outer prod. */ 
S^UNLOCK (lp[row] ) ; 

inner_prod += a[pointer++] *x[row] ; /* this is part of the inner prod. * 

} 

S_LDCK (IpCcolumn] ) ; 

b [column] +* inner_prod; /* store the inner product now */ 

S_UNL0CK (Ip [column] ) ; 

column = m.nextO; /* get next column to work on */ 
pointer = 3 tart [column] ; /* get pointer into array */ 

} 

m w sync(); /* wait until everyone else is done */ 

r 


Figure Code for Symmetric Sparv 1 Matrix- Vector Product 


full.mult (a, x, b, col.num, first, last) 
double 

an, 

x[], 

' b[] ; 

3hort int 

col.num [] ; 

int 

first , 
last ; 

{ 

double 

inner. prod; 
register int 
pointer , 
row , 
column; 

row ■ 1 ; 
while (row <* unknowns) { 

row = m.nextO; /* get row to work on */ 

inner.prod =* 0.0; 

/♦ 

♦ compute a [row,*] * x[*] { inner product} 

*/ 

for (pointer = row. start [row] ; pointer<row_ start [row+l] ; pointer++){ 
inner. prod +* a[pointer] * x [col. num [pointer] ] ; 

} 

b[row] 3 inner.prod; 

} 

m.syncO ; 

} 

Figure 6: Code for Full sparse matrix- vector product 


/* the nonzero entries of A */ 

/* the vector to rault by */ 

/* the result gets put here +/ 

/* array of column numbers */ 

/* first row we work on */ 

/* we do up to by not including this row */ 


/+ pointer into global DS */ 

/* row that we are working on */ 

/* column number in row that we are using */ 


19 



Method 

Ca*p 

Sequential 

Symmetric 

Nosynch 

Full 


t, 

tp 

effic. 

tp 

effic. 


effic. 

1 

0.86 

.91 

.54 

.58 

.84 

.56 

.87 

*2 

5.29 

.99 

■ 13 

.39 

.75 

.63 

.70 

3 

6.03 

1.06 

.47 

.71 

.71 

.66 

.76 

[ 

7.23 

L .3-5 

.45 

.77 

.78 

.76 

.79 

> 

21.20 

3.64 

T9 

2.41 

,73 

2.27 

.78 

0 

13.20 

3.82 

.56 

2.33 

.91 

2.55 

.84 

7 

1270. M0 

321.8.3 

.33 

226.13 

.47 

207. 16 

.51 


Fable 5: Time in seconds and efficiency of parallel Sparse Matrix- Vector product on 
12 processors relative to sequential code 


bur we stop after the same number of iterations. This is to show the impact of 
the synchronization. It also gives us a lower bound on the time for this method 
of matrix-vector product. The last method, ‘’Full", is the times from the code in 
Figure G. 

We see that storing the full matrix is best. Timing and efficiency is better 
than 70% except for large problems. We expect this, since matrix- vector products 
are very parallel computations. If we look at the difference between the parallel 
times of the Symmetric and Xosynch columns it is clear that to use the system 
synchronization calls adds almost 7)0% to the cost of the computation. But, even 
without the synchronization, we see that the Full method is better than the Nosynch 
method. From this we conclude that there is no advantage in storing only half of a 
symmetric matrix for parallel computation of the sparse mat rix- vector product on 
this machine. 

An alternative to sparse matrix vector multiplication computed as inner or outer 
products is proposed my Melhem [19] where he suggests a general technique of using 
striped matrix storage. 

6 Triangular Solve Revisited 

The decision to Tore the full matrix A affects other parts of the code. We also stored 
the full preconditioner as two triangular matrices. L and L T . both by rows. The new 
values for the timings of the triangular solves are compared with the old values in 
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Problem 


Method 


Symmetric 

Full 

fwd 

bck 

fwd 

bck 

i 

.73 

.38 

.44 

.12 

2 

.55 

.42 

.41 

.42 

3 

.72 

.49 

.53 

.48 

4 

.90 

.55 

.60 

.58 

5 

2.13 

1 .54 

1.58 

L.5 1 

6 

2.73 

1.72 

1.79 

1.70 

7 

206.81 

85.80 

200.5 

2 216.63 


Table 6: Time in seconds for Triangular Solve on L2 processors 


Table 6. The columns labeled "Symmetric'' are for storing only the lower triangular 
half of the symmetric matrix. The columns under ‘'Full are the timings for storing 
both the upper and lower triangular matrices of the preconditioner by rows. The 
forward solve is faster because it uses inner products. There is no synchronization 
for every element of L , only one for each row. We cannot, however, explain the data 
from Case 7. 

7 Parallel Efficiency of ICCG 

In Table 7 we show the time required to solve ( 1) for each implementation, assuming 
the preconditioner has be previously computed. In the first six cases it is clear that 
storing the full matrix is better than storing only its lower triangle. The efficiency 
is near or above 60% for the entire code. This is a very reasonable level and what 
we expected. But, for the seventh case, the code was not efficient. In Appendix C 
we show how the time to access array elements increases as a function of the array 
size and discuss the time for Test Case 7. 

8 Scheduling 

Other scheduling methods not considered here are discussed in [0.10.14.22]. The 
general problem is to schedule a set of partially ordered tasks onto a multiprocessor 
system so that the time required to complete the tasks is minimized. This problem is 
known to belong to the class of '‘strong" NT-hard problems. The work by [9,10.22] 
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Method 

Case 

Sequential 

Symmetric 

Nosynch 

Full 



h 

effic. 

tp 

effic. 

t P 

effic. 

1 

15.26 

2.54 

.50 

2.28 

.55 

1.89 

.67 

2 

17.80 

2.97 

.50 

2.67 

.55 

2.45 

.60 

5 

21.00 

3.89 

.45 

3.48 

.50 

3.07 

.57 

4 

22.52 

4.16 

.15 

3.43 

.54 

5.21 

.58 

5 

72.39 

L 0 . S 3 

.56 

9.58 

.62 

8.85 

.68 

6 

78.09 

1 L.01 

.59 

9.64 

.67 

8.52 

.76 

/ 

2811.11 

655.49 

.36 

609.36 

.38 

660.19 

.35 


Table 7: Time in seconds and efficiency of Total ICCG code on 12 processors relative 
to sequential code 


presents bounds on the number of processors required to compute the tasks in a 
minimum amount of time and bounds the time to compute the tasks with a fixed 
number of processors. Also, in [10], bounds on the ratio of times for two different 
feasible schedules are given. 

Kasahara and Xarita [14] present two different scheduling methods, CP/MISF 
and DF/fUS. CP/MISF stands for critical path/most immediate successors fimt 
and DF/IHS mauds for depth first/implicit heuristic search. The primary difference 
between the two is that the former schedules tasks as soon as possible and the latter 
schedules tasks as late as possible. Both require sorting of tasks at the same level 
according to the number of predessors they have and both are 0(A ~) algorithms, 
where A is the number of vertices in the dependence graph. 

We do not use either of these scheduling techniques. Sorting the tasks at each 
level is expensive. We choose a scheduling method that is not optimal but requires 
very low overhead. 

9 Summary 

We have discussed different approaches for exploiting parallelism in the ICCG method 
for solving large sparse symmetric positive definite systems of equations on a shared 
memory parallel computer. \\V showed that performing a small amount of analysis 
to determine the data dependences can drastically improve the parallel efficiency. 
Additionally, when l he sparsity structure of a triangular matrix was regular then 
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a reordered static scheduling method performed more efficiently than a reordered 
dynamic scheduling method. Finally, we showed that for the Sequent it was more 
efficient to store the whole symmetric matrix by rows rather than only the upper 
or lower triangular part. The code for a full matrix was simpler and recpiired less 
synchronization overhead. 
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A Test Problems 


For this work we have chosen 7 test cases which are representative of the class 
of problems solved by iterative methods. All are from two-dimensional domains. 
The first five are from the Harwell- Boeing collection [8] and the last two are from 
electro-magnetic analysis [■'»]. The test cases are described in Table 8. 


Case 

Ref. 

Description 

Order 

Nonzeros 

l 

[*] 

A nine point discretization of the Laplacian on 
a unit square with Dirichlet boundary condi- 
tion^ LAP30 

900 

4322 

2 

[*] 

Matrix used in modeling power system net- 
works. PSADMIT1 

6 62 

lofis 

3 

M 

Matrix used in modeling power system net- 
works. PSADMIT2 

191 

1080 

l 

[*i 

Matrix used in modeling power system net- 
works. PSADMIT3 

085 

1907 I 

) 

[s] 

Matrix used in modeling power system net- 
works. PSADMIT4 

1138 

2596 

G 

(•’1 

A firsr-order triangular finite element dis- 
cretization of the Laplacian operator on a unit 
square. 

2500 

725 1 

7 

[•">] 

Matrix from a nonlinear magnetostatic ; 
model of a permanent magnet motor, us- 
ing an unstructured finite element mesh with 
mixed triangular and quadrilateral third- 
order elements. 

6517 

69.670 


Table 8: Test Case Descriptions 



B Sequent Overview and Performance Figures 

This section provides an overview of the architecture of the Sequent- Balance 21000 
and the execution times for the operations that wore used in the? timings given in 
this paper. The architectural description is due to Osterhaug [21]. 

B.l The Sequent Architecture 

The Sequent Balance 21000 is a shared memory multiprocessor. The processors 
are identical 10- MHz National Semiconductor 32032's. These are 32-hit processors. 
They operate on a peer basis, executing a single copy of the operating systems 
executive, or "kernel”. 

There is no designated "master 1 cpu. All processors, memory modules, and i/o 
controllers plug into a single high-speed bus. There is hardware support for mutual 
exclusion - to support exclusive access to shared data structures, the system includes 
up to G4K user-accessible hardware spin-locks. 

The system we used has 12 processors and 28 Mbytes of memory. In addition, 
each cpu has 8 Kbytes of local RAM and 8 Kbytes of cache RAM. The local RAM 
holds a copy of certain frequently used kernel code and read-only kernel data struc- 
tures. The cache RAM holds blocks of system memory most recently used by the 
cpu. 


Operation 

Operand 

4-Byte Integer 

4- Byte Real 

8-Byte Real 

Addition 

4.4 

32.4 

18.9 

Multiplication 

12.7 

28.1 

20.8 

Division 

17.0 

33.0 

25.5 


Table 9: Time in ^seconds for Arithmetic Operations 


B.2 System Timing 

This section provides execution times in microseconds for a variety of operations that 
were used by the programs discussed in this paper. Times for arithmetic operations 
are shown in Table 9. These timings are computed by looping through a program 
segment 50.000 times. The time before the loop was executed was then subtracted 
from the time at the end of the loop. Some time was subtracted for loop overhead 
and then that time was divided by the number of iterations through the loop. 
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Locking and unlocking of locations in the hardware atomic fork memory was 
done by the in-line C macros S_L0CK() and SJJNLGCKO. If we assume there is no 
contention for the lock, locking and unlocking a lock takes a total of 35 microsec- 
onds. The system provided routines in the Parallel Programming Library were 
slower, taking 53 microseconds. A function. mjiextO. is provided to increment a 
global counter and return the current value. This function takes an average of 49 
microseconds per call. 



C Memory Access Times on the Sequent 

la tli«? IC'CG method every element of L . L T . and .1 are read once each iteration. 
To explain the inefficiency of Test Case 7 we ran a simple test t hat iterates over 
different, array sizes accessing each element once per iteration. We measured the 
average time to access an array element as a function of the size of the array. 
We created a program with a double-precision array, big.arrayf], with 200.000 
elements. Then, we timed the following two loops: 

for (test.size =1000; test_size< 10000 ; test. size += 1000) { 
timer (ftstart. time) ; 
for (i=0; i<200; i++) { 

for (j=0; j<test_size; j++) { /* first loop */ 

local = big. array [j] ; 

} 

> 

sep.t imer (&end_time) ; 

> 


for (test. size =10000; test.size<200001 ; 
timer (ftstart. time) ; 
for (i=0; i<200; i++) { 

for (j=0; j<test.size; j++) { 
local = big.arrayfj] ; 

} 


> 


sep.t imer (&end. time) ; 

} 


test.size += 10000) { 


/* second loop */ 


We copy the elements of the array one at a time to a scalar variable local. In 
the first loop, the number of array elements accessed, test.size. varies from 1.000 
to 9.000 in increments of 1,000. In the second loop, the number of array elements 
accessed, test.size, varies from 10,000 to 100,000 in increments of 10.000. We 
loop 200 times for each test size and divide the rime by the total number of array 
accesses. This was done 5 times for each test and the results were averaged. 

The timing routine returns both user time (utime) and system time (stime) 
separately rather than the sum of the two We used the sum in all previous timings. 
These quantities are defined as follows: 

user time the total amount of time spent executing in user mode 
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User time per memory access as a function of array size 



Array size in thousands 


Figure 7: Fser Time in /usees for Memory Access as a function of Array Size 

system time the total amount of time spent in the system on behalf of t he process. 

The results of this test are shown in Figures 7 and 8. The times are measured in 
microseconds. There is approximately a 2% increase in user time per access as the 
array size is increased from L20K to 130K. But, there is a factor of 5 increase in 
system time per access as the number of array elements in the test case increases 
from 120K to 130K. Recall, the total storage for full A in Test Case 7 is 132.823 
double-precision numbers. The Sequent takes longer to access each element for this 
large problem than for all the other test cases. 

In Table 10 we show separate entries for the user time and the system time for the 
sequential, symmetric and full matrix implementations of Test Case 7. The feature 
to notice is the drastic increase in system time for the forward and backward solve 
in the 'Full" case as compared with the corresponding times of the ‘'Symmetric" 
and the "Sequent iaP implementations. This is partially explained by the test loops 
above. We see that large problems such as Test Case 7 are not efficient on the 
Sequent. 


30 





utime 

73.33 

72. 1 7 

173.28 

349.04 

full 

stime 

127.19 

144-4 $ 

34.18 

310.25 

( parallel) 

totals 

200.52 

216.63 

207.46 

660. L9 


Table 10: Detailed timing of Case 7 in seconds 
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